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Predictive algorithms play a crucial role in 
systems management by alerting the user to 
potential failures. We report on three case 
studies dealing with the prediction of failures 
in computer systems: (1) long-term prediction 
of performance variables (e.g., disk utilization), 
(2) short-term prediction of abnormal behavior 
(e.g., threshold violations), and (3) short-term 
prediction of system events (e.g., router 
failure). Empirical results show that predictive 
algorithms can be successfully employed in 
the estimation of performance variables and 
the prediction of critical events. 


An important characteristic of an intelligent agent 
is its ability to learn from previous experience in or¬ 
der to predict future events. The mechanization of 
the learning process by computer algorithms has led 
to vast amounts of research in the construction of 
predictive algorithms. In this paper, we narrow our 
attention to the realm of computer systems; we dem¬ 
onstrate how predictive algorithms enable us to an¬ 
ticipate the occurrence of events of interest related 
to system failures, such as CPU overload, threshold 
violations, and low response time. 

Predictive algorithms can play a crucial role in sys¬ 
tems management. The ability to predict service 
problems in computer networks, and to respond to 
those warnings by applying corrective actions, brings 
multiple benefits. First, detecting system failures on 
a few servers can prevent the spread of those fail¬ 
ures to the entire network. For example, low re¬ 
sponse time on a server may gradually escalate to 
technical difficulties on all nodes attempting to com¬ 


municate with that server. Second, prediction can 
be used to ensure continuous provision of network 
services through the automatic implementation of 
corrective actions. For example, prediction of high 
CPU demand on a server can initiate a process to bal¬ 
ance the CPU load by rerouting new demands to a 
back-up server. 

Several types of questions are often raised in the area 
of computer systems: 

• What will be the disk utilization or CPU utilization 
next month (next year)? 

• What will be the server workload in the next hour 
(n minutes)? 

• Can we predict a severe system event (e.g., router 
failure) in the next n minutes? 

The questions above differ in two main aspects: time 
horizon and object of prediction. The former char¬ 
acterizes our ability to perform short-term or long¬ 
term predictions and has a direct bearing on the kind 
of corrective actions one can apply. Any action re¬ 
quiring human intervention requires at least several 
hours, but if actions are automated, minutes or even 
seconds may suffice. The latter relates to the out¬ 
come of a prediction and can be either a numeric 
variable (e.g., amount of disk utilization) or a cat¬ 
egorical event (e.g., router failure). 
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Both time horizon and object of prediction are im¬ 
portant factors in deciding which predictive algo¬ 
rithm to use. In this paper, we present three major 
predictive algorithms addressing the following prob¬ 
lems: (1) long-term prediction of performance var¬ 
iables (e.g., disk utilization), (2) short-term predic¬ 
tion of abnormal behavior (e.g., threshold violations), 
and (3) short-term prediction of system events (e.g., 
router failure). The first problem is solved using a 
regression-based approach. A salient characteristic 
of a regression algorithm is the ability to form a piece- 
wise model of the time series that can capture pat¬ 
terns occurring at different points in time. The sec¬ 
ond problem employs time-series analysis to predict 
abnormal behavior (e.g., threshold violations); pre¬ 
diction is achieved through a form of hypothesis test¬ 
ing. The third problem predicts critical events by us¬ 
ing data-mining techniques to search for patterns 
frequently occurring before these events. 

Our goal in this paper is to provide some criteria in 
the selection of predictive algorithms. We proceed 
by matching problem characteristics (e.g., time ho¬ 
rizon and object of prediction) with the right pre¬ 
dictive algorithm. We use our selection criteria in 
three case studies corresponding to the problems de¬ 
scribed above. 

Extensive work has been conducted in the past try¬ 
ing to predict computer performance. For example, 
work is reported in the prediction of network per¬ 
formance to support dynamic scheduling, 1 in the pre¬ 
diction of traffic network, 2 and in the production of 
a branch predictor to improve the performance of 
a deep pipelined micro-architecture. 3 Other stud¬ 
ies reported in the literature 4-7 focus on predicting 
at the instruction level, whereas we focus on predict¬ 
ing at the system and event level (e.g., response time, 
CPU utilization, network node down, etc.). A com¬ 
mon approach to performance prediction proceeds 
analytically, by relying on specific performance mod¬ 
els; one example is in the study of prediction models 
at the source code level, which plays an important 
role for compiler optimization, programming envi¬ 
ronments, and debugging tools. 8 Our view of the pre¬ 
diction problem is mainly driven by historical data 
(i.e., is data-based). Many studies have tried to bridge 
the gap between a model-based approach versus a 
data-based approach. 9 

The rest of the paper is organized as follows. First 
we provide a general view of prediction algorithms 
and describe our approach to selecting an algorithm 
for the problem at hand. In the following section we 


discuss algorithms for long-term prediction of com¬ 
puter performance. Next we discuss an algorithm for 
detecting threshold violations of workload demands, 
and then we describe our approach to the predic¬ 
tion of system events. We list our conclusions in the 
last section. 

Prediction in computer networks 

We begin by giving a general view of the prediction 
problem. We then provide some criteria for select¬ 
ing a predictive algorithm to use, based on the char¬ 
acteristics of the problem at hand. 

A formulation of the prediction problem. To make 
predictions, one needs access to historical informa¬ 
tion. We define historical information as an ordered 
collection of data, S n that starts at a point in time 
t !, covering events up to a final time t { . Specifically, 
S t = {sj}, 1 < j < i, where the j th element is a 
pair Sj = ( Vj , tj). The first element of the pair, vj, 
indicates the value of one or more variables of in¬ 
terest, whereas the second element of the pair, t n 
indicates its occurrence time. The elements of S t are 
ordered, that is, t t < t k for any j < k. 

As an example, assume monitoring systems capture 
the disk utilization on a server at five-minute inter¬ 
vals during an experiment of one hour. In this case, 
the historical information is the collection of pairs 
{( Vj, tj)}f= i, where Vj is the disk utilization at time 
tj, and time increases in five-minute steps. In some 
cases we want to capture the values associated with 
several variables at time t n where the first element 
of each event pair is a vector v jm For example, vj = 
( Vji, v j2 , v j3 ), where the values on the right repre¬ 
sent the disk utilization, the memory utilization, and 
the CPU utilization at time tj, respectively. We col¬ 
lect data up to a point in time t n and the goal is to 
predict the disk utilization at a time t i+k (i.e., k steps 
in the future). 

A prediction is an estimation of the value of a vari¬ 
able v i+k occurring at time t i+k in the future condi¬ 
tioned on historical information S t . Hence, a 
prediction is the output of a generic function con¬ 
ditioned on S t , v l+k = g(Si) + e t , in which g(-) is 
a function capturing the predictable component and 
e t models the possible noise. 

Normally, the further out our prediction, the less ac¬ 
curate the result. Hence, a predicted value is ideally 
accompanied by a probability term that reflects our 
degree of confidence. This confidence can be mea- 
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Figure 1 A homogeneous (A) and a nonhomogeneous (B) time series 



sured by a conditional probability, P(v i+k \Si). Al¬ 
though determining the conditional probability 
P(v i+k \Si) is always desirable, it is not always 
possible. 

Data characteristics. Variables of interest whose val¬ 
ues we might want to predict include the memory 
utilization or disk utilization on a host or group of 
hosts, the number of HyperText Transfer Protocol 
(http) operations per second in a Web server, and 
the status of a network node (up or down) at a given 
time. In all these cases we rely on historical infor¬ 
mation to anticipate future behavior. We wish to em¬ 
phasize the temporal component of this information: 
knowing when an event occurred in the past is as 
important as its nature. 

A first step in prediction is looking for a technique 
matching the characteristics of the problem. Impor¬ 
tant factors are the discrete or continuous nature of 
the data, whether observations are taken at equal 
time intervals or not, and whether the data are ag¬ 
gregated over time intervals or correspond to instan¬ 
taneous values. For example, most techniques based 
on time series analysis deal with discrete observa¬ 
tions taken at equal time intervals. 

We characterize historical information along two di¬ 
mensions: data type and sampling frequency. Data 
types can be either numeric (e.g., memory utiliza¬ 
tion is 80 percent) or categorical (e.g., event type is 
“router failure”). The sampling mechanism depends 
on the data collecting method and is either periodic 
sampling (i.e., equal time intervals) or triggered sam¬ 


pling (i.e., data collected when a predefined condi¬ 
tion is satisfied). Data collected by periodic sampling 
include performance measurements such as utiliza¬ 
tion and end-to-end response time to a probing agent 
(e.g., “ping” and mail server probe). 

Prediction techniques. Once the problem is well 
characterized, there are often a wide variety of pre¬ 
diction techniques available. In some cases we rely 
on classical time-series analysis, whereas in other 
cases we employ data-mining techniques. An impor¬ 
tant factor that differentiates among techniques is 
whether or not the model is homogeneous through 
time. A homogeneous model captures key charac¬ 
teristics of the time series such as the general trend, 
seasonal variation, and variation in the stationary re¬ 
siduals. 10 Figure 1A shows the general trend and sea¬ 
sonal variation on a time series. The general trend 
could correspond to a constant rate of increase of 
the CPU utilization on a server over months, whereas 
the seasonal variation could reflect some, say, 
monthly activity particular to the customer. If one 
were to remove these variations from the data, the 
result would be a stationary time series (as explained 
later in this paper). When these variations are 
present, the general assumption is that they persist 
throughout the entire time interval. 

We also consider the case where key characteristics 
on the time series vary significantly depending on 
the time and the state of the system being modeled. 
Figure IB shows a scenario where both trend and 
seasonal effects are not constant through time. It is 
under these conditions that using new techniques can 
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Table 1 Selecting a predictive algorithm based on domain 
properties 


Short-Term 

Long-Term 


Predictions 

Predictions 

Numeric Data 

Stationary models 

Trend and seasonal 


for time series 

analysis 

Boolean Data 

Data mining 

Periodicity analysis 
failure model 


add flexibility to the prediction process, and as we 
show later, this flexibility often results in improved 
accuracy. 

Selection criteria. Selecting the right predictive al¬ 
gorithm depends on at least two factors: the time of 
the prediction and the type of data. The first factor 
can be divided into short-term prediction and long¬ 
term prediction. A difficulty inherent to this differ¬ 
entiation is to ascribe a precise meaning to both 
terms. In the context of computer systems, it is rea¬ 
sonable to assume short-term prediction in the range 
of minutes or hours, and long-term prediction in the 
range of days, weeks, or months. The second factor 
can be either numeric or Boolean. In some cases it 
is also important to note if the observations were 
made at equal time intervals or not. 

Table 1 presents our selection guidelines for a pre¬ 
diction technique based on the factors above. Long¬ 
term predictions of numeric data need to consider 
the general trend and seasonal variations. The gen¬ 
eral trend measures the long-term change in the 
mean level, whereas seasonal variations are normally 
the result of long-term fluctuations. Trend and sea¬ 
sonal variations normally account for most of the 
long-term behavior of a time series. We exemplify 
this case in the next section. 

Short-term predictions of numeric variables are at¬ 
tained by applying classical time-series analysis over 
stationary data. The data obtained by removing the 
general trend (or mean) and the seasonal variations 
are usually stationary (no systematic change is de¬ 
tected). For equally spaced sampling of data, one 
can use models such as autoregressive processes and 
moving averages. 10 We exemplify this case in the sec¬ 
tion “Predicting threshold violations.” 

On the other hand, short-term predictions of cat¬ 
egorical variables (not necessarily from equally 
spaced data) are attainable by relying on data- 
mining techniques. Recent years have seen an ex¬ 


plosion in the study of data-mining techniques look¬ 
ing for different forms of temporal patterns. 11-14 A 
common technique is to find frequent subsequences 
of events in the data. An additional step, however, 
is needed to integrate these patterns into a model 
for prediction. 1518 We exemplify this case in the sec¬ 
tion “Predicting target events in production net¬ 
works.” 

Our last scenario deals with long-term predictions 
of Boolean data, for which there are two different 
methods available: use of periodic patterns and use 
of failure models. Periodic patterns are represented 
as sets of events occurring at regular intervals of 
time. 19 In computer networks, for example, a peri¬ 
odic pattern may correspond to high CPU utilization 
on several servers at regular time intervals due to 
scheduled maintenance. Periodic patterns maybe re¬ 
moved if they reflect normal behavior, or used for 
prediction if signaling uncommon situations. Fail¬ 
ure models have been used to model device life 
times. 20 Modeling involves a mathematical equation 
(e.g., Poisson lift span, independent failures) that 
must capture the true data distribution. 

The next three sections describe real applications 
that exemplify our choice of predictive algorithms. 
We focus on the first three cases described above in 
the context of computer systems. 

Predicting computer performance 

Our first study deals with the problem of long-term 
predictions on numeric data (Table 1). We wish to 
predict performance parameters, such as response 
time or disk utilization, for capacity planning. 21 Es¬ 
timating the future value of a performance param¬ 
eter is helpful in assessing the need to acquire ad¬ 
ditional devices (e.g., extra memory or disk space) 
to ensure continuity in all network services. Here we 
focus on the nonstationary components of the time 
series. In particular, we look at the general trend. 
We overview two different approaches: the tradi¬ 
tional fc-step extrapolation, and learning to map k- 
steps ahead. We look at each approach in turn. 

Extrapolation. A familiar approach to the prediction 
of the general trend is to fit a model to the data and 
then to extrapolate k steps ahead in time. For ex¬ 
ample, a simple technique is to assume the data fol¬ 
low a linear trend plus noise of the form 

V j = ft + ffyj + e n 1 ~j ~ i (1) 
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where and /3 2 are constants and e ; - is a random 
variable with zero mean. Prediction is simply a mat¬ 
ter of projecting Equation 1 to find the value of v i+k . 
For example, applying a linear least-squares regres¬ 
sion over disk utilization vs time on a computer server 
where data are aggregated on a monthly basis can 
indicate a critical threshold will be reached in ap¬ 
proximately five months. In some cases we may find 
we can do a better job by fitting the data using poly¬ 
nomial curves. 

Learning to predict k steps ahead. A different per¬ 
spective to the prediction problem uses concepts 
from machine learning. Instead of fitting a model to 
our historical data we could try to learn a mapping 
between the state of a computer at time tj and the 
state of the computer at time t j+k . The mapping can 
be described by the following equation: 

v j+ k = g(vj ) + €j ( 2 ) 

In other words, we try to learn how to estimate the 
value of the performance variable k steps in the fu¬ 
ture by creating a set of pairs {( v ] , v j+k )}, using our 
historical data (i.e., match each measurement with 
the measurement k steps ahead). The problem is now 
transformed into that of function approximation: we 
want to approximate the function that generated 
these points. Learning this mapping gives a direct 
model for prediction, which we can use to estimate 
v i+k , where v l is the last observation in S t . The na¬ 
ture of g can take on different forms: it can be rep¬ 
resented as a linear function of v ] , as a decision tree, 
a neural network, etc. We do not restrict the nature 
of function g to a specific form, but simply indicate 
its functionality: to map computer states from time 
tj to time t j+k . 

Approximating function g obviates any form of ex¬ 
trapolation. The difference with extrapolation is that 
in this formulation we need to approximate a dif¬ 
ferent function g for each value of k. The advantage 
lies on the flexibility imbued in the model: it enables 
us to deal with time series where key characteristics 
may vary significantly through time (see the subsec¬ 
tion “Prediction techniques,” earlier). 

The general approach to learn to perform predic¬ 
tions is to transform the original database to reflect 
the mapping between a state at time t } and a state 
at time t j+k . The idea is to cast the prediction prob¬ 
lem into a learning problem. In a learning problem, 
the input data are normally represented in tabular 


form, where each record represents an example char¬ 
acterized by features, and the last column is the tar¬ 
get class to which the example belongs. A numeric 
class calls for regression methods (as in our case), 
whereas a categorical or nominal class calls for clas¬ 
sification methods. The goal is to learn how to map 
feature values into class values in order to predict 
the class of new examples. 22,23 

Returning to the problem of predicting computer 
performance, remember our historical data are an 
ordered collection of events. Each event can play the 
role of an example characterized by one or multiple 
performance variables at time tj. The target class of 
the example corresponds to the value of the predic¬ 
tive variable at time t j+k (in order to learn to predict 
the value of the performance variable k steps in the 
future). 

Empirical findings. Different algorithms can be used 
to learn the mapping mentioned above, including lin¬ 
ear and nonlinear regression methods and decision 
trees for regression. Our experiments using these 
techniques reveal two interesting findings. First, cast¬ 
ing the prediction problem into a learning problem 
yields significant gains in accuracy compared to tra¬ 
ditional techniques. Our conclusions come from ex¬ 
periments on a central database that contains infor¬ 
mation on the performance of thousands of 
IBM AS/400* computers. Each record in the database 
reports the values of tens of performance parame¬ 
ters for a particular machine and month of the year. 
We form predictions for six important parameters: 
response time, maximum response time, CPU utili¬ 
zation, memory utilization, disk utilization, and disk 
arm utilization. Figure 2 compares a multivariate lin¬ 
ear regression model using the learning approach 
vs the extrapolation approach. We measure relative 
error defined as the ratio between the error of the 
multivariate linear model and the error of a simple 
baseline model that takes the mean of all past val¬ 
ues to predict future values. For all six performance 
variables under study, applying the model on the 
learning approach yields significant gains in accuracy 
(Figure 2). 

A second interesting finding is that learning from 
data extracted from multiple computers of similar 
architecture yields better accuracy than learning from 
data extracted from a single computer. It is reason¬ 
able to suppose that computers having similar ar¬ 
chitecture will experience similar performance if the 
overall utilization is the same. Hence, looking for 
patterns across computers increases the evidential 
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Figure 2 A comparison of multivariate linear 
regression models using the learning 
approach vs the extrapolation approach 
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support for correlations between performance var¬ 
iables and target variables. As an example, assume 
CPU utilization is a function of the memory size. Data 
from a single computer may show evidence of a pos¬ 
itive correlation, but with high variance due to the 
limited number of points available. In contrast, data 
from multiple computers enables us to increase our 
confidence in the quality of the model and therefore 
in our predictions. 

Predicting threshold violations 

Our second study deals with the problem of short¬ 
term predictions on numeric data (Table 1). We as¬ 
sume the existence of Internet-attached servers that 
have time-varying workloads. We describe a system¬ 
atic, statistical approach to the characterization of 
normal operation for time-varying workloads, and 
we apply this approach to problem detection for a 
Web server by predicting threshold violations. 24 

We show how our method can be used to construct 
a predictive model for the purposes of workload fore¬ 
casting. We begin by developing a statistical model 
of the time-varying behavior of the data and then 
remove the nonstationary components. This prob¬ 
lem differs significantly from the study in our pre¬ 
vious section. Here we assume a performance line 
centered around a constant value (/x). Our goal is 
not to predict the general trend, but to detect when 


a deviation from the trend is extreme. We do this 
by first removing mean and seasonal effects (i.e., the 
nonstationary components). We then look to the re¬ 
siduals in search of abnormal behavior. 

Removing mean and seasonal effects. The data we 
consider span a time interval of eight months (June 
1996 through January 1997) from a production Web 
server at a large corporation. Data are aggregated 
over five-minute intervals. The variable of interest 
is HTTP operations per second (httpops), since this 
is an overall indicator of the demands placed on the 
Web server. 

We begin by considering the effect of time of day. 
Let v jd be the value of httpops for the j th five-minute 
interval (time-of-day value) and the d th day in the 
data collected. Figure 3A plots v jd for a work week 
(Monday through Friday) in June of 1996 and a work 
week in November of 1996. The %-axis is time, and 
they-axis is httpops. 

We partition v jd into three components: the grand 
mean, the deviation from the mean due to the j th 
time-of-day value (e.g., 9:05 A.M.), and a random er¬ 
ror term that captures daily variability. The grand 
mean is denoted by /x. The j th time-of-day devia¬ 
tion from the grand mean is denoted by a ] (note that 
oij = 0). The error term is denoted by e jd . 

The model is: 

Vjd = ^ + Ctj + €j d (3) 

We use the residuals to look for more patterns in 
the data after time-of-day effects have been removed. 
Figure 3B plots the residuals for Equation 3. Ob¬ 
serve that much of the rise in the middle of the day 
(as evidenced in Figure 3A) has been removed. 

A further examination of Figure 3B indicates that 
there is a weekly pattern. Let denote the effect 
of the w th day of the work week. As with a, this is 
a deviation from the grand mean (/x). Thus, = 

0. Our extended model is: 

Vjdw = ^ + oij + [3 W + e jdw (4) 

Note that since we include another parameter (day 
of week), another subscript is required for both v 
and e. The residuals of this model are plotted in 
Figure 3C. 
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Figure 3 Rate of HTTP transactions vs time for a Web server 



Reprinted from Computer Networks, Vol. 35, No. 1, J. L. Hellerstein, F. Zhang, and R Shahabuddin, “A Statistical Approach to Predictive Detection,” 
pp. 77-95, Figure 2, (c) 2001, with permission from Elsevier Science. 


Looking further, we observe that another pattern re¬ 
mains: httpops is larger in November than it is in 
June. To eliminate this, we extend our model to con¬ 
sider the month. Let y m denote the effect of the mth 
month. As with a and j8, y m = 0. The model 
here is: 

v jdwm = ^ + otj + + y m + e jdwm (5) 

Once again, another subscript is added to both v and 
e. 


An autoregressive model. Until now we have been 
able to account for the mean, daily, weekly, and 
monthly effects. Figure 3D still has time serial de¬ 
pendencies. To remove these dependencies, we ex¬ 
tend the characterization in Equation 5. We assume 
that the time index time t can be expressed as a func¬ 
tion of (;, d, w, m). Then, we consider the follow¬ 
ing model: 

e, = 0ie,_i + 0 2 e,- 2 + u, (6) 
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This is a second-order autoregressive model 
(AR(2)). Here, 0 1 and 0 2 are parameters of the 
model (which are estimated from the data), and the 
u t are independent and identically distributed ran¬ 
dom variables. The model parameters are estimated 
using standard techniques. 25 

Now consider the prediction of threshold violations. 
Current practice for problem detection is to estab¬ 
lish threshold values for measurement values. If the 
observed value violates its threshold, an alarm is 
raised. Threshold values are obtained from histor¬ 
ical data, such as the 95th quantile. 26 

Unfortunately, there is a significant difficulty with 
this approach in practice: normal load fluctuations 
are so great that a single threshold is inadequate. 
That is, a single threshold either results in an exces¬ 
sive number of false alarms, or the threshold fails 
to raise an alarm when a problem occurs. Some per¬ 
formance management products attempt to over¬ 
come this difficulty by allowing installations to spec¬ 
ify different thresholds at different times of the day, 
day of the week, etc. However, requiring that instal¬ 
lations supply additional thresholds greatly adds to 
the burden of managing these installations. 

Prediction using change-point detection. We pro¬ 
pose here an approach in which we use the charac¬ 
terization model to remove all known patterns in the 
measurement data, including the time-serial depen¬ 
dencies. For httpops in the Web server data, this 
means using Equation 5 to remove “low frequency” 
behavior, and then applying Equation 6 to the re¬ 
siduals of this equation so as to remove time-serial 
dependencies. The residuals of Equation 6 consti¬ 
tute filtered data for which all patterns in the char¬ 
acterization have been removed. Last, a change-point 
detection algorithm is applied to these filtered data 
to detect anomalies, such as an increase in the mean 
or the variance. 

There are many algorithms for change-point detec¬ 
tion. 27 Herein, we use the GLR (Generalized Like¬ 
lihood Ratio) algorithm. This is an on-line technique 
that examines observations in sequence rather than 
in mass. When a change has been detected, an alarm 
is raised. 

First, we introduce some terminology. Recall that 
u t is the tt h residual obtained by filtering the raw 
data using a characterization such as Equations 5 and 
6. We consider two time windows, that is, a set of 
time indexes at which data are obtained. The first 


is the reference window ; values in this window are used 
to estimate parameters of the “null hypothesis” in 
the test for a change point. The reference window 
starts with the time at which the last change point 
was detected; it continues through the current time 
(t). Within the reference window, u t has variance 
cr 2 . The second time window is the test window. Val¬ 
ues in this window are used to estimate parameters 
of the “alternative hypothesis” that a change point 
has occurred. The test window spans t - L through 
t. L should be large enough to get a stable estimate 
of cr 2 ' (the variance of u t in the test window), but 
small enough so that change points are readily de¬ 
tected. 

Empirical findings. We apply the foregoing approach 
to the Web server data collected on July 15, 1996, 
a day for which no anomaly is apparent. Figure 4A 
displays httpops for this day. The vertical lines in¬ 
dicate where change points are detected using the 
GLR algorithm. Note that not taking into account 
normal load fluctuations, as is often done in prac¬ 
tice, would have resulted in six alarms even though 
no problem is apparent. Figure 4B plots the resid¬ 
uals after using Equation 5 to filter the raw data and 
Equation 6 to filter the residuals produced by it. Ob¬ 
serve that the GLR algorithm does not detect any 
change point. Hence, taking normal behavior into 
account enables our algorithm to reduce the num¬ 
ber of false alarms to only those cases where abnor¬ 
mal deviations from the mean are authentic. 

Predicting target events in production 
networks 

Our third study deals with the problem of short-term 
predictions on categorical data (Table 1). The pre¬ 
diction targets are computer-network events, corre¬ 
sponding either to single hosts (e.g., CPU utilization 
is above a critical threshold), or to a network (e.g., 
communication link is down). Monitoring systems 
capture thousands of events in short time periods; 
data analysis techniques may reveal useful patterns 
characterizing network problems. 28 

The task of predicting target events across a com¬ 
puter network exhibits characteristics different than 
the problems discussed in the last two sections (pre¬ 
dicting performance variables or predicting thresh¬ 
old violations). A prediction here is an estimation 
of a categorical or nominal value in the near future 
(e.g., communication link will be down within five 
minutes). 
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Figure 4 Change-points in data: (A) raw data, and (B) data after filtering 



Target events and correlated events. We let the user 
specify what events are of interest. For example, a 
system administrator may wish to understand what 
causes a printer error, or why a particular server is 
not responding within a specified time threshold. We 
refer to these events, such as all occurrences of a 
printer error, as the set of target events . 29,30 We as¬ 
sume the proportion of target events with respect to 
all events is low; target events do not represent a 
global property, such as periodicity or constant trend, 
but rather a local property, such as a computer at¬ 
tack on a host network. 

Figure 5 shows the idea behind our predictive algo¬ 
rithm. We look at those events occurring within a 
time window of size W (user-defined) before a tar¬ 
get event. We are interested in finding sets of event 
types, referred to from now on as eventsets , frequently 
occurring before a target event. A solution to the 
problem above is important to many real applica¬ 
tions. Understanding the conditions preceding a sys¬ 
tem failure may pinpoint its cause. On the other 
hand, anticipating a system failure enables us to ap¬ 
ply corrective actions before the failure actually oc¬ 
curs. For example, an attack on a computer network 
may be characterized by an infrequent but highly cor¬ 
related subsequence of events preceding the attack. 

Technical approach. The problem of finding mean¬ 
ingful eventsets preceding the occurrence of target 
events, which we then use to build a model for pre- 


Figure 5 Target events and correlated events 



diction, can be divided in three steps: (1) use asso¬ 
ciations to find frequent eventsets within the time 
windows preceding target events; (2) validate those 
eventsets against events outside the time windows 
considered in step 1; (3) build a rule-based model 
for prediction. We explain each step next. 

Finding frequent eventsets. Our first step makes use 
of mining of frequent itemsets as follows. Consider 
a single target event e *. The conditions preceding 
e * can be characterized by simply recording the event 
types within a window of size W. For example, if each 
target event is preceded by four different events 
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within a window of size W, then each window can 
be represented as an event transaction T made of 
four event types (e.g., T = {e u e 2 , e 3 , e 4 }). Note 
that it is admissible for consecutive target events to 
generate time windows that overlap. 

The procedure above can be applied over all occur¬ 
rences of target events to generate a set of event 
transactions D. More specifically, our algorithm 
makes one pass through the sequence of events in 
D , which we assume to be in increasing order along 
time. With each new event, the current time is up¬ 
dated; the algorithm keeps in memory only those 
events within a time window of size W from the cur¬ 
rent time. If the current event is a target event, the 
set of event types contained in the most recent time 
window become a new transaction in D. Finally, we 
use association-rule mining 31 to find large eventsets, 
that is, eventsets with frequency above minimum sup¬ 
port (e.g., a priori algorithm). Our work is in some 
sense related to the area of sequential mining, 11 ' 14 
in which traditional association mining is extended 
to search for frequent subsequences. 

Note that the ordering of events and the interarrival 
time between events within each time window is not 
relevant. This is useful when an eventset occurs un¬ 
der different permutations, and when interarrival 
times exhibit high variation (i.e., signals are noisy). 
These characteristics are present in many domains, 
including the real production network used for our ex¬ 
periments. For example, we observed that a printer- 
network problem may generate a set of events un¬ 
der different permutations and with interarrival-time 
variation in the order of seconds. Our approach to 
overcome these uncertainties is to collect all event 
types falling inside the time windows preceding tar¬ 
get events, which can then be simply treated as da¬ 
tabase transactions. 

Validating eventsets or patterns. For a target event 
such as “host A is down,” an example of an eventset 
Z frequently occurring before the target event is “low 
response time and high CPU utilization.” We refer 
to Z as a pattern. We may associate a pattern Z with 
the occurrence of the target event if Z does not oc¬ 
cur frequently outside the time windows preceding 
target events. Otherwise Z would appear as the re¬ 
sult of background noise, or of some global prop¬ 
erty of the whole event sequence. For example, if 
“low response time” is constant through time, it can¬ 
not be used for prediction. 


We start by computing the confidence of each event- 
set or pattern, filtering out those below a minimum 
degree of confidence. Confidence is an estimation 
of the conditional probability of Z belonging to a 
time window that precedes a target event, given that 
Z matches the event types in that same time win¬ 
dow. Specifically, if D is the database capturing all 
eventsets preceding target events, then let D' be de¬ 
fined as the complement database capturing all 
eventsets occurring within time windows of size W 
not preceding target events. Let x 1 and x 2 be the 
number of transactions in D and D ', respectively, 
matched by eventset Z. We eliminate all Z below 
a minimum confidence level, where confidence is de¬ 
fined as follows: 

confidence(Z, B, B') = x 1 /(x 1 + x 2 ) (7) 

In addition, our filtering mechanism performs one 
more test to validate an eventset. The reason is that 
confidence alone is not sufficient to guarantee that 
the probability of finding an eventset Z within da¬ 
tabase/) is significantly higher than the correspond¬ 
ing probability in D '; confidence does not check for 
negative correlations. 32 Thus, we add a validation 
step described as follows. 

Let P(Z\D) denote the probability of Z occurring 
within database/), andF(Z|Z)') the corresponding 
probability within D '. Eventset Z is validated if we 
can reject the null hypothesis 

H 0 :P(Z\D)=P(Z\D') (8) 

with high confidence. If the number of events is large, 
one can assume a Gaussian distribution and reject 
the null hypothesis in favor of the alternative hypoth¬ 
esis 

H 1 :P(Z\D)> P(Z\D') (9) 

if for a given confidence level a the difference be¬ 
tween the two probabilities (normalized to obtain a 
standard normal variate) is greater by z a standard 
deviations. In such case we reject H Q . The proba¬ 
bility of this happening when H 0 is actually true is 
a. By choosing a small a we can be almost certain 
that Z is related to the occurrence of target events. 

In summary, our validation phase ensures that the 
probability of an eventset Z appearing before a tar¬ 
get event is significantly larger than the probability 
of Z not appearing before target events. The vali- 
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Figure 6 Error of rule-based model vs size of window preceding target events 
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dation phase discards any negative correlation be¬ 
tween Z and the occurrence of target events. In ad¬ 
dition, this phase serves as a filtering step to reduce 
the number of candidate patterns used to build a 
rule-based model for prediction. 

Rule-based system for prediction. Our last step uses 
the set of validated patterns to build a rule-based 
system for prediction. Previous work exists combin¬ 
ing associations with classification. 1518 This work dif¬ 
fers in the temporal nature of the data, and in the 
nature of the rule-based system. 

The rationale behind our rule-based system is to find 
the most accurate and specific rules first. 23 Our as¬ 
sumption of having a large number of available 
eventsets and few target events obviates ensuring 
each example is covered by a rule. Specifically, our 
algorithm sorts all eventsets according to confidence 
(ties are resolved by larger frequency and larger size). 
In general, other metrics can be used to replace con¬ 
fidence, 33 such as information gain, gini, or x 2 - Start¬ 
ing with the highest-confidence eventset Z z , we elim¬ 
inate all other eventsets more general than Z,. 
Eventset Z z is said to be more general than eventset 
Zj , if Z t C Zj . For example, eventset { a , b } is more 
general than eventset {a, b, c}. This step eliminates 
eventsets that refer to the same pattern as Z, but 
are more general. The resulting rule is of the form 
Z t —> targetevent. The search then continues with 
the next highest-confidence eventset, until no more 
eventsets are left. 

The final rule-based system 91 can be used for pre¬ 
diction by checking for the occurrence of any of the 
eventsets in 91 along the event sequence set apart 


for testing. The model predicts finding a target event 
within a time window of size W after any such event- 
set is detected. 

Empirical findings. We report results obtained from 
a production computer network. Data were obtained 
by monitoring systems active during one month on 
a network having 750 hosts. One month of contin¬ 
uous monitoring generated over 26000 events, with 
165 different types of events. All events serve as in¬ 
put to the system. Our analysis concentrates on two 
types of target events labeled as critical by domain 
experts. The first type, URL (uniform resource lo¬ 
cator) Time-Out, indicates a Web site is unacces- 
sible. The second type, EPP (end-to-end probing plat¬ 
form) Event, indicates that end-to-end response time 
to a host generated by a probing mechanism is above 
a critical threshold. 

The first 50 percent of events serve for training and 
the other 50 percent serve for testing. Error is com¬ 
puted on the testing set only as follows. Starting at 
the beginning of the sequence, nonoverlapping time 
windows of size W that do not intersect the set of 
time windows preceding target events are considered 
negative examples; all time windows preceding tar¬ 
get events are considered positive examples. Error 
is defined as the fraction of examples incorrectly clas¬ 
sified by the rule-based model. 

We investigate the effect of varying the time window 
preceding target events on the error of the rule-based 
model. In all our experiments, the error correspond¬ 
ing to false positives is small (<0.1) and does not 
vary significantly while increasing the time windows. 
We focus on the false negative error rate defined as 
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the proportion of times the rule-based model fails 
to predict a true target event. Figure 6A shows our 
results when the target event corresponds to URL 
Time-Out on a particular host. With a time window 
of 300 seconds, the error is 0.39 (9/23). But as the 
time window increases, the error decreases signif¬ 
icantly. Evidently, larger time windows enable us to 
capture more information preceding target events. 
Figure 6B shows our results with a different target 
event: epp Event on a particular host. With a time 
window of 300 seconds the error is as high as 0.83 
(9/62). Increasing the window to 2000 seconds brings 
the error rate down to 0.16. Our results highlight the 
importance of the size of the time window preced¬ 
ing target events in order to capture relevant pat¬ 
terns. 

We also investigate the effect of having a warning 
window before each target event in case the rule- 
based model were used in a real-time scenario with 
a need for corrective actions to take place. In this 
case, the algorithm does not capture any events 
within the warning window while characterizing the 
conditions preceding target events. Our results show 
a degradation of performance when the safe window 
is incorporated, albeit to a small degree. On the EPP 
Event, for example, a time window of 300 seconds 
and a safe window of 60 seconds produces the same 
amount of error as when the safe window is omit¬ 
ted. 

Conclusions 

In this study of predictive algorithms, we establish 
a distinction between short- and long-term predic¬ 
tions and between numeric and categorical data. We 
describe three case studies corresponding to the fol¬ 
lowing scenarios: (1) long-term prediction of perfor¬ 
mance variables, (2) short-term prediction of abnor¬ 
mal behavior, and (3) short-term prediction of system 
events. Empirical results show how predictive algo¬ 
rithms can be successfully employed in the estima¬ 
tion of performance variables and critical events. 

Future work will look at possible ways to unify the 
mechanism behind predictive algorithms to enrich 
our understanding of their applicability. For exam¬ 
ple, we note that problems characterized by numer¬ 
ical data can be converted into categorical data and 
vice versa. Aggregating events over fixed time inter¬ 
vals converts categorical data into numerical data. 
For example, workload in a server is computed by 
aggregating the number of site requests over fixed 
time units. Conversely, thresholding can be used to 


transform numbers into categories. For example, a 
measure of the end-to-end response time of a ser¬ 
vice request such as ping or mail probe, is often cat¬ 
egorized as normal or abnormal, depending on 
whether the end-to-end response time exceeds a pre¬ 
defined threshold. 

The transformation above can extend the applica¬ 
bility of predictive algorithms. For example, aggre¬ 
gating the number of times a host is down over fixed 
time intervals enables us to analyze the trend and 
seasonal variation of the host-down frequency. Since 
there may be occasions in which bringing a host down 
is part of scheduled maintenance, the same trans¬ 
formation can be used to detect if host-down fre¬ 
quency falls into abnormal behavior. In the exam¬ 
ple above, all three predictive algorithms described 
in previous sections can play an important role. Our 
goal is to develop tools to transform the input data 
so as to enable the use of different predictive algo¬ 
rithms. The result would increase the amount of in¬ 
formation necessary to determine the root cause of 
a problem and the amount of evidence to perform 
accurate predictions. 
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